## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
This tidy dataset contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating from 0 (very bad) to 10 (very excellent).
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Most of the wines have fixed acidity between 7.10 and 9.20.
The valatile acidity shows a bimodal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Most of the wines have volatile acidity between 0.39 and 0.64.
The residual sugar shows left-biased and long-tailed distribution.
The chlorides show left-biased and long-tailed distribution.
The total sulfur dioxide has some outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
Most of the wines have a density between 0.9956 and 0.9978.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Most of the wines have pH between 3.210 and 3.400.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Most of the wines have 5 or 6 in quality.
There are 1,5999 red wines in the dataset with 13 features (X, fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, alcohol, quality). X identifies the wines, and quality represents that how good the wine. The X and quality are unordered and ordered factor variables, but I treated them as numerical variables for convenience. All other features represent chemical properties of wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Other observations: * Wines with quality 5 or 6 are most common. * The median wine quality is 6. * Most wines have a quality of 5 or better. * About 75% of wines have a quality of 6 or worse. * The worst and best quality in the data set is 3 and 8, respectively.
The main feature in the data set is quality. I’d like to determine which features are best for predicting the wine quality. I suspect quality and some combination of the other variables can be used to build a predictive model for wine quality.
The primary wine characteristics are sweetness, acidity, tannin, alcohol, and body. Residual sugar, fixed and volatile acidity, alcohol, and density determine those characteristics. I guess that these variables are mainly related to the wine quality.
I created a variable for the total acidity using the volatile and the fixed acids.
Volatile acidity shows a bimodal distribution.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00 -0.26 0.67
## volatile.acidity -0.26 1.00 -0.55
## citric.acid 0.67 -0.55 1.00
## residual.sugar 0.11 0.00 0.14
## chlorides 0.09 0.06 0.20
## free.sulfur.dioxide -0.15 -0.01 -0.06
## total.sulfur.dioxide -0.11 0.08 0.04
## density 0.67 0.02 0.36
## pH -0.68 0.23 -0.54
## sulphates 0.18 -0.26 0.31
## alcohol -0.06 -0.20 0.11
## quality 0.12 -0.39 0.23
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.11 0.09 -0.15
## volatile.acidity 0.00 0.06 -0.01
## citric.acid 0.14 0.20 -0.06
## residual.sugar 1.00 0.06 0.19
## chlorides 0.06 1.00 0.01
## free.sulfur.dioxide 0.19 0.01 1.00
## total.sulfur.dioxide 0.20 0.05 0.67
## density 0.36 0.20 -0.02
## pH -0.09 -0.27 0.07
## sulphates 0.01 0.37 0.05
## alcohol 0.04 -0.22 -0.07
## quality 0.01 -0.13 -0.05
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity -0.11 0.67 -0.68 0.18 -0.06
## volatile.acidity 0.08 0.02 0.23 -0.26 -0.20
## citric.acid 0.04 0.36 -0.54 0.31 0.11
## residual.sugar 0.20 0.36 -0.09 0.01 0.04
## chlorides 0.05 0.20 -0.27 0.37 -0.22
## free.sulfur.dioxide 0.67 -0.02 0.07 0.05 -0.07
## total.sulfur.dioxide 1.00 0.07 -0.07 0.04 -0.21
## density 0.07 1.00 -0.34 0.15 -0.50
## pH -0.07 -0.34 1.00 -0.20 0.21
## sulphates 0.04 0.15 -0.20 1.00 0.09
## alcohol -0.21 -0.50 0.21 0.09 1.00
## quality -0.19 -0.17 -0.06 0.25 0.48
## quality
## fixed.acidity 0.12
## volatile.acidity -0.39
## citric.acid 0.23
## residual.sugar 0.01
## chlorides -0.13
## free.sulfur.dioxide -0.05
## total.sulfur.dioxide -0.19
## density -0.17
## pH -0.06
## sulphates 0.25
## alcohol 0.48
## quality 1.00
The fixed acidity and volatile acidity has strong positive and negative correlations with citric acid.
The pH has a strong negative correlation with fixed acidity, citric acid, but does not with volatile acidity.
The fixed acidity and alcohol have significant positive and negative correlations with density, respectively.
Most of the variables do not seem to have strong correlations with quality, but alcohol and volatile acidity have considerable positive and negative correlation with quality, respectively.
ggplot(wqr, aes(x=fixed.acidity, y=pH)) +
geom_point(alpha = 0.3, size = 2) +
stat_smooth(method='lm')
The strongest correlation in this data set appears between fixed acidity and pH. High acidity means low pH, and the graph coincides with this fact.
ggplot(wqr, aes(x=fixed.acidity, y=citric.acid)) +
geom_point(alpha = 0.3, size = 2) +
stat_smooth(method='lm')
ggplot(wqr, aes(x=fixed.acidity, y=density)) +
geom_point(alpha = 0.3, size = 2) +
stat_smooth(method='lm')
The fixed acidity has strong positive correlations with citric acid and density, too.
ggplot(wqr, aes(x=citric.acid, y=volatile.acidity)) +
geom_point(alpha = 0.3, size = 2) +
stat_smooth(method='lm')
ggplot(wqr, aes(x=citric.acid, y=pH)) +
geom_point(alpha = 0.3, size = 2) +
stat_smooth(method='lm')
The citric acid has considerable negative correlations with volatile acidity and pH.
ggplot(wqr, aes(x=alcohol, y=density)) +
geom_point(alpha = 0.3, size = 2) +
stat_smooth(method='lm')
The alcohol and density also shows considerable negative correlation.
ggplot(wqr, aes(x=quality, y=volatile.acidity)) +
geom_jitter(alpha = 0.3, size = 2) +
stat_smooth(method='lm')
ggplot(wqr, aes(x=quality, y=alcohol)) +
geom_jitter(alpha = 0.3, size = 2) +
stat_smooth(method='lm')
Two variables, volatile acidity and alcohol have considerable correlation with quality.
ggplot(wqr, aes(x=factor(quality), y=alcohol)) +
geom_boxplot(notch=FALSE)
It seems that medium and high quality wines have positive relations to alcohol.
ggplot(wqr, aes(x=factor(quality), y=volatile.acidity)) +
geom_boxplot(notch=FALSE)
The trend between volatile acidity and quality is clear. The better quality wine has the less volatile acidity.
The quality correlates with alcohol and volatile acidity.
Citric acid is one of the main component of fixed acidity. As a result, they have strong positive correlation.
Low pH causes more fixed acidiity. Therfore, fixed acidity and citric acid negatively correlates to the pH.
A wine with more volatle acidity tends to have less citric acid.
A wine with more fixed acidity tends to more dense. By the way, A wine with more alcohol tends to less dense.
The fixed acidity is positively and strongly corrrelated with citric acid and density. The citric acid may substitute for fixed acidity and density with even better estimation of wine quality.
Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.
ggplot(wqr, aes(x=alcohol, y=volatile.acidity)) +
geom_point(alpha = 0.5, size = 2, position = 'jitter', aes(color=quality)) +
scale_color_gradient2(midpoint=mean(wqr$quality), low="blue", mid="white",
high="red", space ="Lab" )
ggplot(wqr, aes(x=alcohol, y=1/volatile.acidity)) +
geom_point(alpha = 0.5, size = 2, position = 'jitter', aes(color=quality)) +
scale_color_gradient2(midpoint=mean(wqr$quality), low="blue", mid="white",
high="red", space ="Lab" )
ggplot(wqr, aes(x=alcohol, y=1/volatile.acidity)) +
geom_point(alpha = 0.5, size = 2, position = 'jitter', aes(color=quality)) +
scale_color_gradient2(midpoint=mean(wqr$quality), low="blue", mid="white",
high="red", space ="Lab" )
cor.test(wqr$alcohol - wqr$volatile.acidity, wqr$quality)
##
## Pearson's product-moment correlation
##
## data: wqr$alcohol - wqr$volatile.acidity and wqr$quality
## t = 24.166, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4806396 0.5524750
## sample estimates:
## cor
## 0.5174684
cor.test(wqr$alcohol - wqr$volatile.acidity^3 + wqr$citric.acid - wqr$pH + wqr$sulphates, wqr$quality)
##
## Pearson's product-moment correlation
##
## data: wqr$alcohol - wqr$volatile.acidity^3 + wqr$citric.acid - wqr$pH + and wqr$quality wqr$sulphates and wqr$quality
## t = 26.927, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5241297 0.5916089
## sample estimates:
## cor
## 0.5587935
cor.test(wqr$fixed.acidity+wqr$citric.acid-wqr$density, wqr$quality)
##
## Pearson's product-moment correlation
##
## data: wqr$fixed.acidity + wqr$citric.acid - wqr$density and wqr$quality
## t = 5.6008, df = 1597, p-value = 2.509e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0903879 0.1865460
## sample estimates:
## cor
## 0.1387941
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.
Tip: Don’t forget to remove this, and the other Tip sections before saving your final work and knitting the final report!